Challenge_6_Jyoti Rani

challenge_6

air_bnb

Visualizing Time and Relationships

Author

Jyoti Rani

Published

August 24, 2022

Air_Bnb <- read_csv("_data/AB_NYC_2019.csv", col_names = c("del", "name", "del", "host_name","neighbourhood_group", "neighbourhood", "latitude", "longitude", "room_type", "price", "minimum_nights", "number_of_reviews", "last_review", "reviews_per_month", "calculated_host_listings_count", "availability_365" ), skip=1) %>% 
  select(!starts_with("del")) %>%
  drop_na(reviews_per_month)

New names:
Rows: 48895 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(5): name, host_name, neighbourhood_group, neighbourhood, room_type dbl (10):
del...1, del...3, latitude, longitude, price, minimum_nights, num... date (1):
last_review
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `del` -> `del...1`
• `del` -> `del...3`

Air_Bnb

# A tibble: 38,843 × 14
   name    host_…¹ neigh…² neigh…³ latit…⁴ longi…⁵ room_…⁶ price minim…⁷ numbe…⁸
   <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl> <chr>   <dbl>   <dbl>   <dbl>
 1 Clean … John    Brookl… Kensin…    40.6   -74.0 Privat…   149       1       9
 2 Skylit… Jennif… Manhat… Midtown    40.8   -74.0 Entire…   225       1      45
 3 Cozy E… LisaRo… Brookl… Clinto…    40.7   -74.0 Entire…    89       1     270
 4 Entire… Laura   Manhat… East H…    40.8   -73.9 Entire…    80      10       9
 5 Large … Chris   Manhat… Murray…    40.7   -74.0 Entire…   200       3      74
 6 BlissA… Garon   Brookl… Bedfor…    40.7   -74.0 Privat…    60      45      49
 7 Large … Shunic… Manhat… Hell's…    40.8   -74.0 Privat…    79       2     430
 8 Cozy C… MaryEl… Manhat… Upper …    40.8   -74.0 Privat…    79       2     118
 9 Cute &… Ben     Manhat… Chinat…    40.7   -74.0 Entire…   150       1     160
10 Beauti… Lena    Manhat… Upper …    40.8   -74.0 Entire…   135       5      53
# … with 38,833 more rows, 4 more variables: last_review <date>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>, and abbreviated variable names ¹host_name,
#   ²neighbourhood_group, ³neighbourhood, ⁴latitude, ⁵longitude, ⁶room_type,
#   ⁷minimum_nights, ⁸number_of_reviews

print(dfSummary(Air_Bnb, varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

Air_Bnb

Dimensions: 38843 x 14
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

name [character]

1. Home away from home

2. Loft Suite @ The Box Hous

3. Private Room

4. Brooklyn Apartment

5. Cozy Brooklyn Apartment

6. New york Multi-unit build

7. Private room

8. Beautiful Brooklyn Browns

9. Harlem Gem

10. Hillside Hotel

[ 38253 others ]

12	(	0.0%	)
11	(	0.0%	)
10	(	0.0%	)
9	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
38750	(	99.8%	)

6 (0.0%)

host_name [character]

1. Michael

2. David

3. John

4. Alex

5. Sonder (NYC)

6. Sarah

7. Maria

8. Daniel

9. Jessica

10. Anna

[ 9876 others ]

335	(	0.9%	)
309	(	0.8%	)
250	(	0.6%	)
229	(	0.6%	)
207	(	0.5%	)
179	(	0.5%	)
174	(	0.4%	)
170	(	0.4%	)
170	(	0.4%	)
160	(	0.4%	)
36644	(	94.4%	)

16 (0.0%)

neighbourhood_group [character]

1. Bronx

2. Brooklyn

3. Manhattan

4. Queens

5. Staten Island

876	(	2.3%	)
16447	(	42.3%	)
16632	(	42.8%	)
4574	(	11.8%	)
314	(	0.8%	)

0 (0.0%)

neighbourhood [character]

1. Williamsburg

2. Bedford-Stuyvesant

3. Harlem

4. Bushwick

5. Hell's Kitchen

6. East Village

7. Upper West Side

8. Upper East Side

9. Crown Heights

10. Midtown

[ 208 others ]

3163	(	8.1%	)
3141	(	8.1%	)
2206	(	5.7%	)
1944	(	5.0%	)
1532	(	3.9%	)
1490	(	3.8%	)
1482	(	3.8%	)
1405	(	3.6%	)
1265	(	3.3%	)
986	(	2.5%	)
20229	(	52.1%	)

0 (0.0%)

latitude [numeric]

Mean (sd) : 40.7 (0.1)

min ≤ med ≤ max:

40.5 ≤ 40.7 ≤ 40.9

IQR (CV) : 0.1 (0)

17443 distinct values

0 (0.0%)

longitude [numeric]

Mean (sd) : -74 (0)

min ≤ med ≤ max:

-74.2 ≤ -74 ≤ -73.7

IQR (CV) : 0 (0)

13641 distinct values

0 (0.0%)

room_type [character]

1. Entire home/apt

2. Private room

3. Shared room

20332	(	52.3%	)
17665	(	45.5%	)
846	(	2.2%	)

0 (0.0%)

price [numeric]

Mean (sd) : 142.3 (196.9)

min ≤ med ≤ max:

0 ≤ 101 ≤ 10000

IQR (CV) : 101 (1.4)

581 distinct values

0 (0.0%)

minimum_nights [numeric]

Mean (sd) : 5.9 (17.4)

min ≤ med ≤ max:

1 ≤ 2 ≤ 1250

IQR (CV) : 3 (3)

89 distinct values

0 (0.0%)

number_of_reviews [numeric]

Mean (sd) : 29.3 (48.2)

min ≤ med ≤ max:

1 ≤ 9 ≤ 629

IQR (CV) : 30 (1.6)

393 distinct values

0 (0.0%)

last_review [Date]

min : 2011-03-28

med : 2019-05-19

max : 2019-07-08

range : 8y 3m 10d

1764 distinct values

0 (0.0%)

reviews_per_month [numeric]

Mean (sd) : 1.4 (1.7)

min ≤ med ≤ max:

0 ≤ 0.7 ≤ 58.5

IQR (CV) : 1.8 (1.2)

937 distinct values

0 (0.0%)

calculated_host_listings_count [numeric]

Mean (sd) : 5.2 (26.3)

min ≤ med ≤ max:

1 ≤ 1 ≤ 327

IQR (CV) : 1 (5.1)

47 distinct values

0 (0.0%)

availability_365 [numeric]

Mean (sd) : 114.9 (129.5)

min ≤ med ≤ max:

0 ≤ 55 ≤ 365

IQR (CV) : 229 (1.1)

366 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-12-23

Briefly describe the data

We have NYC Air BnB data from 2019. To Tidy the data, I removed the id columns and rows with a value of 0 in them for number_of_reviews. I will to compare number of reviews with other variables to see if reviews have any effect whether it is the independent or dependent variable. By looking at the summary table I could also investigate the number of reviews for each neighborhood_group and neighborhood. I will also be looking into the date-time variable to see if the other values change in comparison to it.

Time Dependent Visualization

ggplot(Air_Bnb, aes(x = last_review, y = number_of_reviews)) +
  geom_line() + 
  labs(title = "Number of Reviews for Airbnb Listings", x = "Date of Last Review", 
     y = "Number of Reviews") + 
  theme_bw()

Here my graph shows the number of reviews over the dat of last review of all data instances. I can conclude from the graph that if the date of last review is closer to the present then it is more likely that it has more reviews than other data instances.

Visualizing Part-Whole Relationships

ggplot(Air_Bnb) + 
  geom_bin2d(mapping = aes(x = latitude, y = longitude))

ggplot(Air_Bnb) + 
  geom_point(mapping = aes(x = latitude, y = longitude, color = room_type))

Here I have two graphs that both represent that latitude and longitude of each location. The first graph shows the density of the locations of the air BnBs. I can see that most of the Air Bnb locations are clustered around 40.7 latitude. The second graph shows within those densities which room type is most common. I can conclude that as longitude increases, the type of room becomes dominated by Private room. This could mean that as we move up north there should be more Private rooms.